Food Accessibility in U.S. Counties

Author

Tenzin Sherpa

The Issue

Maintaining a healthy and consistent diet makes for our most basic and important needs. Access to food sources, however, may be limited for many people, leading to issues of food insecurity. In 2010, 14.5% (17.2 million households) in the U.S. were food insecure, 5.4% of which had very low food security [1].

Many are left to ask:
“Where’s my next meal going to come from?”

A potential factor that plays a role in determining whether households have a good and consistent supply of nutritious food may be their accessibility to food. Food accessibility is associated with the proximity to grocery stores, food markets, and other food sources. This project looks at the scope of food accessibility for households in different counties. The project also covers the effect of food environment factors such as income, population demographics, and the availability of nutrition programs on food accessibility and food insecurity.

Data

The project uses the Food Environment Atlas Data from 9/10/2020 (Economic Research Service, n.d.). The dataset has been compiled by the US Department of Agriculture (USDA) Economic Research Service (ERS) to study factors that affect food choices and the accessibility of healthy foods in communities. The information in the dataset had been aggregated from various reports from the USDA, the Bureau of the Census, the U.S. Department of Commerce, and more for the years 2006 to 2019. Food accessibility population data, which is the focus of the project, were given for 2010 and 2015.

The dataset contains county-level information on food environment factors such as access to grocery stores/supermarkets/restaurants, local food sales, food prices, food assistance programs like SNAP (Supplemental Nutrition Assistance Program), National School Lunch Program, etc., socioeconomic characteristics, and some health/physical activities. State-level information on household food insecurity is also given. The dataset contained data on 3143 counties, each uniquely identified with their FIPS code.

To enable geographic visualizations and analysis, an additional file from the ArcGIS Hub containing geographic boundary shapes is used.

Preprocessing

The information in the dataset was stored in separate worksheets, grouped according to the category of the features. Since there was a large number of features, the first step was to perform feature reduction. This was done by manually selecting the features that were related to the project’s questions and goals. Features that contained very granular information were also omitted in order to simplify analysis as well as to obtain more generalized insights.

The dataset and boundary shape file was also compared to check if county FIPS codes and names matched for all observations. From 2010 to 2015, there had been changes to some of the county names and FIPS codes which caused discrepancies between the two datasets. These counties were identified and modified to reflect the changes and match all observations. This brought the number of observations from 3143 to 3142 counties.

For columns with a low number of missing values, imputation was done using the median of those features. This was done while looking at variable distributions to ensure that imputation does not significantly change their distributions. However, median imputation was not appropriate for some 2015 variables that had a larger number of missing values. For these variables, missing values were replaced with numbers from the corresponding feature for 2010.

Overall, the data preprocessing involved merging data, selecting relevant variables and imputing missing values while exploring the nature of the variables. The cleaned dataset was then saved to a new file that is used for analysis.

Visualizations

Using the cleaned dataset, some variables of interest were explored by producing visualizations.

Food Accessibility

To gauge the level of food accessibility in counties, the dataset provides the percentage of the population with low food accessibility. For this dataset, low accessibility populations are considered to be those “living more than 1 mile from a supermarket or large grocery store if in an urban area, or more than 10 miles from a supermarket or large grocery store if in a rural area” [2].

Code
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import plotly.express as px
import plotly.io as pio


from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
Code
df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')

sns.set(style='darkgrid')
def plt_dis(c):
    f = sns.displot(data=df, x=c, height=4, aspect=10/8.27, bins=20)
    plt.xlim([0, 100])
    plt.ylabel("Number of Counties", alpha=0.8)
    plt.xlabel("Low Food Access Pop. %", alpha=0.8)
    plt.show()

plt_dis('PCT_LACCESS_POP10')

Figure 1: Distribution of County % Population with Low Food Accessibility

The distribution of the variable reflects an interesting characteristic of food accessibility in the U.S. The two distinct modes of distribution suggest that counties can fall under two distinct groups based on the extent of food accessibility. For the majority of the counties, <40% of the population has low food accessibility, while for around 130 counties, the entire population (reaching 100%) has low accessibility to food sources.

Geographic Visualization

Code
df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')

import json
with open(r'./../src/USA_Counties_(Generalized).geojson') as f:
    tract = json.load(f)

def plot_chlpth(df, x, label, header):
  fig = px.choropleth(df, locations='FIPS', color=x, color_continuous_scale="Viridis",
                        range_color=(0, 100), geojson=tract, featureidkey = "properties.FIPS",
                        scope="usa", hover_data=['State', 'County'],
                        labels=label, title=header)
  fig.update_layout(legend=dict(orientation="h", yanchor="top",y=1.02,xanchor="right",x=1))
  fig.show()

plot_chlpth(df, 'PCT_LACCESS_POP10', {'PCT_LACCESS_POP10':'Popn. %'}, None)

Figure 2: Low Food Accessibility Population Percentages in 2010

The choropleth shows that low food accessibility counties are concentrated in the Midwest, Southwest, and West regions of the country. Since food accessibility is defined based on the proximity of households to food stores, the location of the low-access counties could be the product of geographic conditions and living conditions in remote regions.

Food and Vehicle Access

The dataset measures food accessibility based on the proximity of residential areas to food stores. However, it is important to note that the availability of vehicles can also play a significant role in determining food accessibility. While some populations may live far from stores, having a car or access to transportation can help cover greater distances to purchase food.

Code
def plt_dis(c):
    f = sns.displot(data=df, x=c, y='PCT_LACCESS_POP10', height=4, aspect=10/8.27, bins=10)
    plt.xlim([-5, 100])
    plt.ylabel("Low Food Access Pop. %", alpha=0.9)
    plt.xlabel("Household % with Low Food Access & No Vehicle", alpha=0.8)
    plt.show()

plt_dis('PCT_LACCESS_HHNV10')

Figure 3: Low Food Access Population and Household Vehicle Availability

To take account of this additional variable, we look at the density plot of household vehicle availability and county food accessibility. The blocks farther away from the axes represent the areas where both food accessibility and vehicle availability for the populations living there are low. These counties are likely to have the biggest challenges of food accessibility.

In other counties, although food accessibility is bad, most households have access to vehicles (shown by the few dark areas). For most counties, 0-40% have low food accesibility, but only <10% of the households with low food access do not have vehicles. Additionally, the lower percentage of households having vehicles is seen for some of the counties where larger proportion of populations did not have good food access. This could mean that the problem of food accessibility in these counties may be lower than depicted earlier.

Income and Food Access

Income is another crucial factor that can impact access to food. Low-income groups may be more sensitive to food prices and the lack of food availability in close proximity to their homes. The scatter plot charts the population with low food accessibility and populations with both low income and low food access. Here, low income is defined as “annual family income of less than or equal to 200 percent of the Federal poverty threshold based on family size.” [3]

Code
sns.set(color_codes=True)
f, ax = plt.subplots(figsize=(6, 5))
sns.scatterplot(data=df, x='PCT_LACCESS_POP10', y='PCT_LACCESS_LOWI10')
ax.set_ylabel("County Population % with Low Access and Low Income", size = 12, alpha=0.9)
ax.set_xlabel("County Population % with Low Food Access",size = 12, alpha=0.9)
ax.tick_params(axis='both', which='major', labelsize=14)
plt.xlim([0, 102])
xpoints = ypoints = ax.get_xlim()
ax.plot(xpoints, ypoints, linestyle='--', color='cornflowerblue', lw=1, scalex=False, scaley=False)
f.show()

Figure 4: Low Food Access Population and Household Vehicle Availability

The plot shows that the data points always fall below the identity function line. So, the proportion of population with both low food access and low incom is less than the population percentage with low food access. This suggests that not whole of the population with low food access have low income (as defined for the datset). However, the plot also shows that counties with worse food access have greater percentages of people with low access and low income. Thus, income can be correlated with food accessibility. Furthermore, the variation in these percentage values increases as accessibility worsens. This is also reminiscent of the variation in household availability for counties with bad food access.

Demographic Characteristics

In counties where >80% of the population has low food accessibility, the majority of the population is white, followed by Hispanic and American Indian / Alaska Native.

Code
df_race = pd.DataFrame(df, columns=['FIPS', 'State', 'County', 'PCT_LACCESS_POP10','PCT_NHWHITE10',
                                    'PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10','PCT_NHNA10',
                                    'PCT_NHPI10', 'PCT_LACCESS_POP15', 'PCT_LACCESS_WHITE15',
                                    'PCT_LACCESS_BLACK15', 'PCT_LACCESS_HISP15','PCT_LACCESS_NHASIAN15', 
                                    'PCT_LACCESS_NHNA15', 'PCT_LACCESS_NHPI15','PCT_LACCESS_MULTIR15'])
# df_race.rename(columns = lambda x : str(x)[:-2], inplace=True)
# df_race.rename(columns={'PCT_NHWHITE10':'WHITE','PCT_NHBLACK10':'BLACK', 'PCT_HISP10':'HISPANIC', 'PCT_NHASIAN10':'ASIAN'
#   'PCT_NHNA10':'AMER INDIAN / ALASKA NATIVE','PCT_NHPI10':'HAWAIIAN / PACIFIC ISLANDER', 'PCT_LACCESS_WHITE15': 'WHITE',
#                                     'PCT_LACCESS_BLACK15':'BLACK', 'PCT_LACCESS_HISP15':'HISPANIC','PCT_LACCESS_NHASIAN15':'ASIAN', 
#                                     'PCT_LACCESS_NHNA15':'AMER INDIAN / ALASKA NATIVE', 'PCT_LACCESS_NHPI15':'HAWAIIAN / PACIFIC ISLANDER',
#                                     'PCT_LACCESS_MULTIR15':'MULTIR'}, inplace=True)


low_access10 = df_race[df['PCT_LACCESS_POP10']>80]
# type(low_access10.mean())
print(low_access10.mean()[1:7])
# table = [low_access10.mean()[1:7],]

low_access15 = df_race[df['PCT_LACCESS_POP15']>80]
print(low_access15.mean()[8:])
PCT_NHWHITE10    84.927261
PCT_NHBLACK10     0.601197
PCT_HISP10        9.255983
PCT_NHASIAN10     0.302071
PCT_NHNA10        3.632700
PCT_NHPI10        0.031591
dtype: float64
PCT_LACCESS_WHITE15      85.049586
PCT_LACCESS_BLACK15       0.913152
PCT_LACCESS_HISP15        9.924802
PCT_LACCESS_NHASIAN15     0.342595
PCT_LACCESS_NHNA15        5.611213
PCT_LACCESS_NHPI15        0.061190
PCT_LACCESS_MULTIR15      5.215931
dtype: float64

Features’ Relationships

Next, we investigate the dependence of some variables in the dataset. First, we look at how food accessibility variables affect food insecurity feature given in the dataset. Second, we look at the presence of food assistance program compared with food accessibility levels.

Food Insecurity and Accessibility

The dataset provides state-level food insecurity information. Food-insecure households are those that “were unable, at times during the year, to provide adequate food for one or more household members because the household lacked money and other resources for food.” [4] The values measuring food insecurity were given as three-year averages. In the following model, the food insecurity percent average for 2012-2014 is used as the response variable for the training set whereas the food insecurity percent average for 2015-2017 is used as the response variable for the test set. The data frame used here was created by grouping the rows by state and taking the average. Thus, the data frame contains 51 rows with state-level information.

Code
df.drop(labels=['LACCESS_POP10', 'LACCESS_POP15', 'PCH_LACCESS_POP_10_15', 'LACCESS_LOWI10', 'LACCESS_LOWI15',
                    'PCH_LACCESS_LOWI_10_15', 'LACCESS_HHNV10', 'LACCESS_HHNV15', 'PCH_LACCESS_HHNV_10_15',
                    'LACCESS_SNAP15', 'LACCESS_CHILD10', 'LACCESS_CHILD15', 'LACCESS_CHILD_10_15', 
                    'LACCESS_SENIORS10', 'LACCESS_SENIORS15', 'PCH_LACCESS_SENIORS_10_15', 'LACCESS_WHITE15', 
                    'LACCESS_BLACK15', 'LACCESS_HISP15', 'LACCESS_NHASIAN15', 'LACCESS_NHNA15', 
                    'LACCESS_NHPI15', 'LACCESS_MULTIR15'], axis=1, inplace=True)

df_avg = df.groupby(['State']).mean()

df10_cols = ['PCT_LACCESS_POP10', 'PCT_LACCESS_LOWI10', 'PCT_LACCESS_HHNV10',
             'PCT_LACCESS_CHILD10','PCT_LACCESS_SENIORS10', 'PCT_FREE_LUNCH10', 'PCT_REDUCED_LUNCH10',
             'PCT_NHWHITE10', 'PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10', 'PCT_NHNA10', 'PCT_NHPI10',
             'PCT_65OLDER10', 'PCT_18YOUNGER10',
             'GROCPTH11', 'SUPERCPTH11', 'CONVSPTH11', 'SPECSPTH11', 
             'SNAPSPTH12', 'FFRPTH11', 'FSRPTH11', 'PC_FFRSALES07','PC_FSRSALES07', 'PCT_SNAP12', 
             'SNAP_PART_RATE11', 'PCT_NSLP12','PCT_SBP12', 'PCT_SFSP12',
             'FDPIR12','FOODINSEC_12_14','VLFOODSEC_12_14','DIRSALES_FARMS07', 'FMRKTPTH13']


df15_cols = ['PCT_LACCESS_POP15', 'PCT_LACCESS_LOWI15', 'PCT_LACCESS_HHNV15', 
             'PCT_LACCESS_SNAP15', 'PCT_LACCESS_CHILD15', 'PCT_LACCESS_SENIORS15', 'PCT_LACCESS_WHITE15',
             'PCT_LACCESS_BLACK15','PCT_LACCESS_HISP15', 'PCT_LACCESS_NHASIAN15', 'PCT_LACCESS_NHNA15', 
             'PCT_LACCESS_NHPI15', 'PCT_LACCESS_MULTIR15', 'PCT_FREE_LUNCH15', 'PCT_REDUCED_LUNCH15', 
             'GROCPTH16', 'SUPERCPTH16', 'CONVSPTH16', 'SPECSPTH16', 'SNAPSPTH17', 'FFRPTH16', 'FSRPTH16',
             'PC_FFRSALES12', 'PC_FSRSALES12', 'PCT_SNAP17','SNAP_PART_RATE16', 'PCT_NSLP17', 'PCT_SBP17', 'PCT_SFSP17',
             'FDPIR15', 'FOODINSEC_15_17','VLFOODSEC_15_17', 'DIRSALES_FARMS12', 'FMRKTPTH18']

# Foodhub, metro 13, medhincome 15 'POVRATE15' are not here
Code
df10 = pd.DataFrame(df_avg, columns = df10_cols)
df15 = pd.DataFrame(df_avg, columns = df15_cols)

scaler = StandardScaler()

df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)
df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)

df10.rename(columns = lambda x : str(x)[:-2], inplace=True)
df15.rename(columns = lambda x : str(x)[:-2], inplace =True)

X_train = df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK',
                       'PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 
                       'PCT_65OLDER', 'PCT_18YOUNGER', 'VLFOODSEC_12_', 'FOODINSEC_12_'], axis=1)
y_train = df10['FOODINSEC_12_']


X_test = df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE',
             'PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 
             'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR', 'VLFOODSEC_15_', 'FOODINSEC_15_'], axis=1)
y_test = df15['FOODINSEC_15_']

X_train = X_train.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH',
       'CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP',
       'PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)


X_test = X_test.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH',
       'CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP',
       'PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)

import statsmodels.api as sm
X = sm.add_constant(X_train)
model = sm.OLS(y_train, X).fit()
model.summary()
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
print('Test Score:', model.score(X_test,y_test))
R-squared: 0.10953682615612048
P-values: const               1.000000
PCT_LACCESS_POP     0.025167
PCT_LACCESS_LOWI    0.023915
PCT_LACCESS_HHNV    0.592164
dtype: float64
Test Score: 0.018688402604447707

The linear regression returns very low R2 scores suggesting that the food- accessibility variables may not entirely represent the level of food insecurity in states.

Food Assistance Programs

The dataset gives several variables related to food assistance programs. Here, we look at some of these variables to see if these programs are available in areas with low food accessibility. The input variables here are Students eligible for free lunch (%), Students eligible for reduced-price lunch (%), SNAP participants (% pop), SNAP benefits per capita (% change), National School Lunch Program participants (%), School Breakfast Program participants (% children), Summer Food Service Program participants (% children), FDPIR (Food Distribution Program on Indian Reservations) sites. Regression models are created separately for 2010 and 2015 to see how the availability of food assistance programs and food accessibility are related for both years. Following the previous analysis, the same state-level data frame is used for the regression models.

Code
# create 2010 and 2015 dataframes for states
df10 = pd.DataFrame(df_avg, columns = df10_cols)
df15 = pd.DataFrame(df_avg, columns = df15_cols)

scaler = StandardScaler()

df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)
df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)

df10.rename(columns = lambda x : str(x)[:-2], inplace=True)
df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)

df15.rename(columns = lambda x : str(x)[:-2], inplace =True)
df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)


df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK',
                       'PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 
                       'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)
df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE',
             'PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 
             'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)

# training and test datasets according to year
x10 = df10.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)
y10 = df10['PCT_LACCESS_POP']

x15 = df15.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)
y15 = df15['PCT_LACCESS_POP']

x10 = x10.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV',
       'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)
x15 =  x15.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV',
       'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)

#  2010 model
X = sm.add_constant(x10)
model = sm.OLS(y10, X).fit()
model.summary()
print("2010: ")
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)

#  2015 model
X = sm.add_constant(x15)
model = sm.OLS(y15, X).fit()
model.summary()
print("2015: ")
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)
2010: 
R-squared: 0.3607467572659099
P-values: const                1.000000
PCT_FREE_LUNCH       0.840339
PCT_REDUCED_LUNCH    0.944982
PCT_SNAP             0.725729
SNAP_PART_RATE       0.321000
PCT_NSLP             0.316383
PCT_SBP              0.618867
PCT_SFSP             0.422621
FDPIR                0.061563
dtype: float64
2015: 
R-squared: 0.46831649180961343
P-values: const                1.000000
PCT_FREE_LUNCH       0.029108
PCT_REDUCED_LUNCH    0.742560
PCT_SNAP             0.031647
SNAP_PART_RATE       0.034399
PCT_NSLP             0.378130
PCT_SBP              0.069722
PCT_SFSP             0.063479
FDPIR                0.255548
dtype: float64

For both years, the r2 scores are similar and less than 0.5. This suggests that the prevalence of food assistance programs is only partly correlated with food accessibility. Although food assistance programs exist to some extent in states with poor food accessibility, this may not be the case for all states.

Change in Food Accessibility (2010-15)

Lastly, regression models are used to analyze what factors can affect food accessibility in U.S. counties.

Note: The income and household vehicle availability variables are removed here since they are largely correlated with the food accessibility response variables.

Code
# create 2010 and 2015 dataframes for all counties
df10 = pd.DataFrame(df, columns = df10_cols)
df15 = pd.DataFrame(df, columns = df15_cols)

# Rename columns
df10.rename(columns = lambda x : str(x)[:-2], inplace=True)
df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)
df15.rename(columns = lambda x : str(x)[:-2], inplace =True)
df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)

# Drop columns
df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK',
                       'PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 
                       'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)
df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE',
             'PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 
             'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)

# Create dataframe with differences between 2010 & 2015 values as columns
diff = pd.DataFrame()
for c in df10.columns:
    diff[c] = df15[c]-df10[c]

# Scale all columns
scaler = StandardScaler()
diff = pd.DataFrame(scaler.fit_transform(diff), columns=diff.columns)

# Drop correlating variables and create train/test data
X = diff.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)
y = diff['PCT_LACCESS_POP']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply linear regression
X = sm.add_constant(X_train)
model = sm.OLS(y_train, X).fit()
model.summary()
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)
R-squared: 0.02122744699035428
P-values: const                0.716920
PCT_FREE_LUNCH       0.696828
PCT_REDUCED_LUNCH    0.857769
GROCPTH              0.211986
SUPERCPTH            0.045945
CONVSPTH             0.482173
SPECSPTH             0.427779
SNAPSPTH             0.325925
FFRPTH               0.289545
FSRPTH               0.450601
PC_FFRSALES          0.763854
PC_FSRSALES          0.441492
PCT_SNAP             0.242449
SNAP_PART_RATE       0.001517
PCT_NSLP             0.035178
PCT_SBP              0.093460
PCT_SFSP             0.058786
FDPIR                0.236457
FOODINSEC            0.245323
VLFOODSEC            0.788494
DIRSALES_FARMS       0.989146
FMRKTPTH             0.007023
dtype: float64

The model used here does not produce good R2 scores. One reason for this may be that the counties for which there are significant changes are low in number and may seem like unusual values for the dataset. This is shown by the distribution plot shown in Figure 5.

Code
# Plot distribution of difference in low food access pop% 
def plt_dis(c):
    f = sns.displot(data=diff, x=c, height=4, aspect=10/8.27, bins=20)
    plt.ylabel("Number of Counties", alpha=0.8)
    plt.xlabel("Change in Low Food Access Pop. % (standardized)", alpha=0.8)
    plt.show()
plt_dis('PCT_LACCESS_POP')

Figure 5: Distribution of Change in Low Access Population (Standardized)

Now, we create a model using data for only the counties that have significant changes in food accessibility. For this, the dataset was filtered to contain only counties for which the standardized Change in Food Accessibility was greater than 1.5 and less than -1.5.

Code
# filter data for extreme cases
df_extreme = diff[diff['PCT_LACCESS_POP'] > 1.5]
df_extreme = diff[diff['PCT_LACCESS_POP'] < -1.5]

# Drop correlated variables and create train/test data
X = df_extreme.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)
y = df_extreme['PCT_LACCESS_POP']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear regression
print("Linear Regression")
X = sm.add_constant(X_train)
model = sm.OLS(y_train, X).fit()
model.summary()
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)

# Decision Tree
print("Decision Tree Regression")
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_score = dt_model.score(X_test, y_test)
print("Decision Tree score:", dt_score)
Linear Regression
R-squared: 0.24170625736787055
P-values: PCT_FREE_LUNCH       3.299206e-01
PCT_REDUCED_LUNCH    4.821978e-01
GROCPTH              4.878081e-01
SUPERCPTH            8.757824e-01
CONVSPTH             4.228270e-01
SPECSPTH             1.969613e-01
SNAPSPTH             1.497012e-01
FFRPTH               7.908066e-01
FSRPTH               7.414277e-01
PC_FFRSALES          9.810093e-01
PC_FSRSALES          9.459537e-01
PCT_SNAP             5.318101e-01
SNAP_PART_RATE       3.859439e-01
PCT_NSLP             5.183571e-01
PCT_SBP              4.437169e-01
PCT_SFSP             6.699933e-01
FDPIR                2.183979e-08
FOODINSEC            6.302984e-01
VLFOODSEC            6.250474e-01
DIRSALES_FARMS       4.566845e-01
FMRKTPTH             5.418631e-01
dtype: float64
Decision Tree Regression
Decision Tree score: 0.4282723731862612

The R2 score for linear regression improved significantly for this data frame. The R2 further improves when using a Decision Tree Regrressor. # Conlusion